Task: Create this plot¶
1-28-26 Update
make this (or a simialar) scatterplot for E8, E10 and MEF RNA data with overlap for genes (and if you have time also TEs)
You may have better thoughts on this, but here is what I thought of so far: I first thought we could just use the MA plot we already made and add layers of direct targets to it. I asked ChatGTP for advice of how to do it (see attached). It said sth like adding a column w direct targets (peak near gene in ESCs) as TRUE/FALSE in res and then plotting it in layers w ggplot2.
I then asked how to do a scatterplot (also attached). I think the end plot is up to us (I kind pf prefer scatter like in my figure above), but the data table / tidying is the same I think[10:11 AM]
I provide you here (but you also have this in a form) with the ESC targets from the EGFP ChIP and genes nearby (this is from a homer of the bed file obtained from calling peaks of clone 2 over KO clone 21 and overlapping those with peaks of clone 3 over KO clone 21): file genes_near_ESC_peaks
[10:12 AM]and an overlap of those 3163 peaks with repeatmasker (done via bedtools intersect). overlap_3163_repeatmasker
[10:13 AM]I also provide a file where I separated all the mm10 TE names as they occur in the TEtranscripts no-cut-off output file into 3 columns so one column would allow you to match the TE overlap with the TEs in DEseq2. file: All_TE_names_im_TEtranscripts[10:14 AM]Please feed back if this makes no sense. or if you have questions.
[10:15 AM]what ChatGTP told me may not work, but it was a start to think about it. in my head you want to plot DEGs in black and indicate ChIP targets in ESCs in orange/yellow, non signfiicants will be grey. you can make one plot for genes and one for TEs
[10:16 AM]you're better in R than I am, so: have fun! Also: happy to voice call or chat if not clear or you have questions. I'll crack on with the heatmaps.[10:17 AM]I'll also ask, if I have questions.
[10:18 AM]Let's update ourselves during the afternoon how it's going to keep us accountable.
# Import libraries
suppressPackageStartupMessages({
library(DESeq2)
library(RColorBrewer)
library(dplyr)
library(tidyr)
library(plotly)
library(ggplot2)
library(stringr)
require(httr)
require(jsonlite)
library(ggrepel)
})
# Set working directory to correct directory. Change as needed
# Note: for this project, several necessary files are stored here
setwd("/data/gallegosda/GEO/RNA-seq/pca_ma_plots/final/")
Use actual chip-target data¶
peakFileAnnotated = "shared_777_mESC_peaks_EGFP_clone2_3_over_21_3163_mm10_annotated.csv"
genes_near_ESC_peaks_df <- read.csv(peakFileAnnotated)
colnames(genes_near_ESC_peaks_df)[1] <- "PeakID"
head(genes_near_ESC_peaks_df, n=3)
| PeakID | Chr | Start | End | Strand | Peak.Score | Focus.Ratio.Region.Size | Annotation | Detailed.Annotation | Distance.to.TSS | Nearest.PromoterID | Entrez.ID | Nearest.Unigene | Nearest.Refseq | Nearest.Ensembl | Gene.Name | Gene.Alias | Gene.Description | Gene.Type | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| <int> | <chr> | <int> | <int> | <chr> | <int> | <lgl> | <chr> | <chr> | <int> | <chr> | <int> | <lgl> | <chr> | <chr> | <chr> | <chr> | <chr> | <chr> | |
| 1 | 326 | chr11 | 21091301 | 21091616 | + | 0 | NA | 5' UTR (NM_023324, exon 1 of 7) | 5' UTR (NM_023324, exon 1 of 7) | 134 | NM_023324 | 67245 | NA | NM_023324 | ENSMUSG00000020134 | Peli1 | 2810468L03Rik|A930031K15Rik|D11Ertd676e | pellino 1 | protein-coding |
| 2 | 1282 | chr17 | 75551375 | 75551695 | + | 0 | NA | intron (NM_001357860, intron 1 of 6) | CpG | 411 | NM_001357860 | 72722 | NA | NM_133747 | ENSMUSG00000002017 | Fam98a | 2810405J04Rik | family with sequence similarity 98, member A | protein-coding |
| 3 | 2324 | chr6 | 39117684 | 39119019 | + | 0 | NA | promoter-TSS (NR_045813) | promoter-TSS (NR_045813) | -2 | NM_172893 | 243771 | NA | NM_172893 | ENSMUSG00000038507 | Parp12 | 9930021O16|ARTD12|PARP-12|Zc3hdc1 | poly (ADP-ribose) polymerase family, member 12 | protein-coding |
# Read a space or tab-separated file named "my_data.txt" in your working directory
te_near_ESC_peaks_df <- read.table("overlap_3163_repeatmasker.txt", header = FALSE, sep = "")
head(te_near_ESC_peaks_df, n=3)
| V1 | V2 | V3 | V4 | V5 | V6 | |
|---|---|---|---|---|---|---|
| <chr> | <int> | <int> | <chr> | <int> | <chr> | |
| 1 | chr1 | 6382841 | 6382862 | GC_rich | 21 | + |
| 2 | chr1 | 6383205 | 6383216 | GC_rich | 24 | + |
| 3 | chr1 | 9967415 | 9967442 | GC_rich | 27 | + |
colnames(te_near_ESC_peaks_df) <- c("chr", "start", "end", "repeatmaskerDesc", "val", "plusMinus")
head(te_near_ESC_peaks_df, n=3)
| chr | start | end | repeatmaskerDesc | val | plusMinus | |
|---|---|---|---|---|---|---|
| <chr> | <int> | <int> | <chr> | <int> | <chr> | |
| 1 | chr1 | 6382841 | 6382862 | GC_rich | 21 | + |
| 2 | chr1 | 6383205 | 6383216 | GC_rich | 24 | + |
| 3 | chr1 | 9967415 | 9967442 | GC_rich | 27 | + |
all_mouse_gene_ENSEMBL_IDs_and_gene_names_df <- read.csv("all_mouse_gene_ENSEMBL_IDs_and_gene_names.txt")
head(all_mouse_gene_ENSEMBL_IDs_and_gene_names_df, n=3)
| Gene.stable.ID | Gene.stable.ID.version | Transcript.stable.ID | Transcript.stable.ID.version | Gene.name | |
|---|---|---|---|---|---|
| <chr> | <chr> | <chr> | <chr> | <chr> | |
| 1 | ENSMUSG00000064336 | ENSMUSG00000064336.1 | ENSMUST00000082387 | ENSMUST00000082387.1 | mt-Tf |
| 2 | ENSMUSG00000064337 | ENSMUSG00000064337.1 | ENSMUST00000082388 | ENSMUST00000082388.1 | mt-Rnr1 |
| 3 | ENSMUSG00000064338 | ENSMUSG00000064338.1 | ENSMUST00000082389 | ENSMUST00000082389.1 | mt-Tv |
all_mouse_gene_ENSEMBL_IDs_and_gene_names_df[all_mouse_gene_ENSEMBL_IDs_and_gene_names_df$Gene.name == 'Meioc', ]
| Gene.stable.ID | Gene.stable.ID.version | Transcript.stable.ID | Transcript.stable.ID.version | Gene.name | |
|---|---|---|---|---|---|
| <chr> | <chr> | <chr> | <chr> | <chr> | |
| 212473 | ENSMUSG00000051455 | ENSMUSG00000051455.14 | ENSMUST00000156590 | ENSMUST00000156590.8 | Meioc |
| 212474 | ENSMUSG00000051455 | ENSMUSG00000051455.14 | ENSMUST00000100378 | ENSMUST00000100378.4 | Meioc |
| 212475 | ENSMUSG00000051455 | ENSMUSG00000051455.14 | ENSMUST00000155813 | ENSMUST00000155813.2 | Meioc |
nrow(all_mouse_gene_ENSEMBL_IDs_and_gene_names_df)
# Remove duplicates based only on the 'Gene.stable.ID' column, keeping the first occurrence
all_unique_mouse_genes_ids_df <- all_mouse_gene_ENSEMBL_IDs_and_gene_names_df[!duplicated(all_mouse_gene_ENSEMBL_IDs_and_gene_names_df$Gene.stable.ID), ]
nrow(all_unique_mouse_genes_ids_df)
Create a function to generate scatter plots¶
# The createScatterPlots function receives these as inputs:
# 1. Count Table file, cntTableFile -- e.g., "TEtranscripts_GRCm38_E10_777tm1d_2KO_males_vs_1WT_female_1WT_male_non_stranded.cntTable"
# 2. Number of samples in the experimental/treatment group, TGroupNum -- e.g., 2
# 3. Number of samples in the control group, CGroupNum -- e.g., 2
# 4. Number of gene names with high logfold2change values to include as text labels in plot, GeneTELabelNum -- e.g., 20
# 5. Plot width, plotWidth -- e.g., default == 6
# 6. Plot height, plotHeight -- e.g., default == 5
# 7. Plot dpi, plotDpi -- e.g., default == 300
# The createScatterPlots function returns:
# 1. Gene scatter plot
# 2. TE scatter plot
createScatterPlots <- function(cntTableFile, TGroupNum, CGroupNum, GeneTELabelNum, plotWidth = 6, plotHeight = 5, plotDpi = 300) {
# Read cntTableFile
data <- read.table(cntTableFile,header=T,row.names=1)
# Get treatment and control groups
groups <- factor(c(rep("TGroup",TGroupNum),rep("CGroup",CGroupNum)))
# Generate dds
min_read <- 1
data <- data[apply(data,1,function(x){max(x)}) > min_read,]
sampleInfo <- data.frame(groups,row.names=colnames(data))
suppressPackageStartupMessages(library(DESeq2))
dds <- DESeqDataSetFromMatrix(countData = data, colData = sampleInfo, design = ~ groups)
dds$groups = relevel(dds$groups,ref="CGroup")
dds <- DESeq(dds)
# Get res via DESeq2 results function
res <- results(dds,independentFiltering=F)
resSig <- res[(!is.na(res$padj) & (res$padj < 0.050000) & (abs(res$log2FoldChange)> 0.000000)), ]
# Normalize counts per sample
norm <- counts(dds, normalized = TRUE)
# Define treatment and control groups
grp <- colData(dds)$groups # factor with levels like TGroup / CGroup
# Mean normalized count in each condition
x_ctrl <- rowMeans(norm[, grp == "CGroup", drop=FALSE])
y_trt <- rowMeans(norm[, grp == "TGroup", drop=FALSE])
# Assemble plotting table
df <- data.frame(
id = rownames(norm),
baseMean_C = x_ctrl,
baseMean_T = y_trt,
log10_C = log10(x_ctrl + 1),
log10_T = log10(y_trt + 1),
log2FC = res$log2FoldChange,
padj = res$padj,
stringsAsFactors = FALSE
)
# Separate by Genes and TEs
# ---------------------------------------------------------------------------------------------------------
# --- Genes -----------------------------------------------------------------------------------------------
# ---------------------------------------------------------------------------------------------------------
df_genes = df[grep("ENSMUS", row.names(df)),]
# Remove dots and chars following them from id column
df_genes$id <- sub("\\..*", "", df_genes$id)
# Add chip_target boolean column to df_genes: TRUE if a gene is near an ESC ChIP-seq peak, FALSE otherwise
df_genes <- df_genes %>%
mutate(chip_target = id %in% genes_near_ESC_peaks_df$Nearest.Ensembl)
# Add gene column to store gene name via a left join with the annotated ESC ChIP-seq peak file
df_genes <- df_genes %>%
left_join(
genes_near_ESC_peaks_df %>%
select(Nearest.Ensembl, Gene.Name),
by = c("id" = "Nearest.Ensembl")
) %>%
rename(gene = Gene.Name)
# Add a DE column if significant
df_genes$DE <- (!is.na(df_genes$padj) & (df_genes$padj < 0.050000) & (abs(df_genes$log2FC)> 0.000000))
# Specify categories for plot colors
df_genes$category <- "not_DE"
df_genes$category[df_genes$DE & !df_genes$chip_target] <- "DE_no_peak"
df_genes$category[df_genes$DE & df_genes$chip_target] <- "DE_with_peak"
# Left join df_genes with all unique mouse gene names based on ENSEMBL id
df_genes <- df_genes %>%
left_join(
all_unique_mouse_genes_ids_df %>%
select(Gene.stable.ID, Gene.name),
by = c("id" = "Gene.stable.ID")
) %>%
rename(gene_ENSEMBL_join = Gene.name)
# Fill NA gene_ENSEMBL_join names with ENSEMBL id
df_genes$gene_ENSEMBL_join[is.na(df_genes$gene_ENSEMBL_join)] <-
df_genes$id[is.na(df_genes$gene_ENSEMBL_join)]
# Get top GeneLabelNum (e.g., top 20 if user inputs 20 for GeneTELabelNum) gene names in the correct order
# Subset DE genes
df_genes_DE <- subset(
df_genes,
category == "DE_with_peak" | category == "DE_no_peak"
)
sorted_df_genes_DE <- df_genes_DE %>%
arrange(desc(abs(log2FC)))
if (nrow(sorted_df_genes_DE) < GeneTELabelNum) {
print("GeneTELabelNum exceeds number of DE genes. Using max number possible instead.")
top_DE_genes <- sorted_df_genes_DE$gene_ENSEMBL_join[1:nrow(sorted_df_genes_DE)]
} else {
top_DE_genes <- sorted_df_genes_DE$gene_ENSEMBL_join[1:GeneTELabelNum]
}
label_genes <- c(top_DE_genes)
# Generate Plots
# Parse the cntTableFile for relevant labels
# -- Treatment group description
treatmentDesc <- sub(
"TEtranscripts_GRCm38_(.*)_vs_.*",
"\\1",
cntTableFile
)
# -- Control group description
controlDesc <- sub(
".*_vs_(.*)_non_stranded\\.cntTable",
"\\1",
cntTableFile
)
# -- Gene Plot --
p <- ggplot(df_genes, aes(x = log10_C, y = log10_T)) +
geom_point(data = subset(df_genes, category == "not_DE"),
alpha = 0.25, size = 0.8, color = "grey70") +
geom_point(data = subset(df_genes, category == "DE_no_peak"),
alpha = 0.8, size = 1.2, color = "black") +
geom_point(data = subset(df_genes, category == "DE_with_peak"),
alpha = 0.9, size = 1.3, color = "orange3", shape = 15) +
# geom_text(df_DE_with_peak$id) +
geom_text_repel(
data = subset(df_genes, gene_ENSEMBL_join %in% label_genes),
aes(
label = gene_ENSEMBL_join,
color = category
),
size = 2.5,
max.overlaps = Inf
) +
scale_color_manual(
values = c(
not_DE = "grey70",
DE_no_peak = "black",
DE_with_peak = "orange3"
)
) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
coord_equal() +
labs(
x = paste0(controlDesc, " log10(normalizedCount + 1)"),
y = paste0(treatmentDesc, " log10(normalizedCount + 1)"),
title = ""
) +
theme_classic()
p <- p +
labs(
title = str_wrap(
paste0(
treatmentDesc, " vs ", controlDesc,
" normalized gene differential expression top ",length(label_genes)," DE"
),
width = 40
)
)
ggsave(
paste0(treatmentDesc, "_vs_", controlDesc, "_normalized_gene_differential_expression_top_",length(label_genes),"_DE.png"),
p,
width = plotWidth,
height = plotHeight,
dpi = plotDpi
)
# Print gene plot
print(p)
# ---------------------------------------------------------------------------------------------------------
# --- TEs -------------------------------------------------------------------------------------------------
# ---------------------------------------------------------------------------------------------------------
df_te = df[-grep("ENSMUS", row.names(df)),]
df_te$te_name1 <- sub(":.*", "", df_te$id)
# Add chip_target boolean column to df_te: TRUE if a TE is near an ESC ChIP-seq peak (via RepeatMasker),
# FALSE otherwise
df_te <- df_te %>%
mutate(chip_target = te_name1 %in% te_near_ESC_peaks_df$repeatmaskerDesc)
# Note: use te_name1 column as TE name for labels
# Add a DE column if significant
df_te$DE <- (!is.na(df_te$padj) & (df_te$padj < 0.050000) & (abs(df_te$log2FC)> 0.000000))
# Specify categories for plot colors
df_te$category <- "not_DE"
df_te$category[df_te$DE & !df_te$chip_target] <- "DE_no_peak"
df_te$category[df_te$DE & df_te$chip_target] <- "DE_with_peak"
# Get top GeneLabelNum (e.g., top 20 if user inputs 20 for GeneLabelNum) gene names in the correct order
# Subset DE TEs
df_te_DE <- subset(
df_te,
category == "DE_with_peak" | category == "DE_no_peak"
)
sorted_df_te_DE <- df_te_DE %>%
arrange(desc(abs(log2FC)))
if (nrow(sorted_df_te_DE) < GeneTELabelNum) {
print("GeneTELabelNum exceeds number of DE TEs. Using max number possible instead.")
top_DE_te <- sorted_df_te_DE$te_name1[1:nrow(sorted_df_te_DE)]
} else {
top_DE_te <- sorted_df_te_DE$te_name1[1:GeneTELabelNum]
}
label_te <- c(top_DE_te)
q <- ggplot(df_te, aes(x = log10_C, y = log10_T)) +
geom_point(data = subset(df_te, category == "not_DE"),
alpha = 0.25, size = 0.8, color = "grey70") +
geom_point(data = subset(df_te, category == "DE_no_peak"),
alpha = 0.8, size = 1.2, color = "black") +
geom_point(data = subset(df_te, category == "DE_with_peak"),
alpha = 0.9, size = 1.3, color = "orange3", shape = 15) +
# geom_text(df_DE_with_peak$id) +
geom_text_repel(
data = subset(df_te, te_name1 %in% label_te),
aes(
label = te_name1,
color = category
),
size = 2.5,
max.overlaps = Inf
) +
scale_color_manual(
values = c(
not_DE = "grey70",
DE_no_peak = "black",
DE_with_peak = "orange3"
)
) +
geom_abline(slope = 1, intercept = 0, linetype = "dashed") +
coord_equal() +
labs(
x = paste0(controlDesc, " log10(normalizedCount + 1)"),
y = paste0(treatmentDesc, " log10(normalizedCount + 1)"),
title = ""
) +
theme_classic()
q <- q +
labs(
title = str_wrap(
paste0(
treatmentDesc, " vs ", controlDesc,
" normalized transposable element differential expression top ",length(label_te)," DE"
),
width = 40
)
)
ggsave(
paste0(treatmentDesc, "_vs_", controlDesc, "_normalized_transposable_element_differential_expression_top_",length(label_te),"_DE.png"),
q,
width = plotWidth,
height = plotHeight,
dpi = plotDpi
)
# Print TE plot
print(q)
}
createScatterPlots("TEtranscripts_GRCm38_E10_777tm1d_2KO_males_vs_1WT_female_1WT_male_non_stranded.cntTable",2,2,15)
estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing
[1] "GeneTELabelNum exceeds number of DE TEs. Using max number possible instead."
createScatterPlots("TEtranscripts_GRCm38_E8_777tm1a_KO_vs_WT_females_non_stranded.cntTable",10,5,15)
estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing -- replacing outliers and refitting for 105 genes -- DESeq argument 'minReplicatesForReplace' = 7 -- original counts are preserved in counts(dds) estimating dispersions fitting model and testing
[1] "GeneTELabelNum exceeds number of DE TEs. Using max number possible instead."
Warning message: “No shared levels found between `names(values)` of the manual scale and the data's colour values.” Warning message: “No shared levels found between `names(values)` of the manual scale and the data's colour values.”
createScatterPlots("TEtranscripts_GRCm38_E8_777tm1a_KO_vs_WT_males_non_stranded.cntTable",3,5,15)
estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing
[1] "GeneTELabelNum exceeds number of DE TEs. Using max number possible instead."
Warning message: “No shared levels found between `names(values)` of the manual scale and the data's colour values.” Warning message: “No shared levels found between `names(values)` of the manual scale and the data's colour values.”
createScatterPlots("TEtranscripts_GRCm38_mESC_R1_777KO_EGFP_excised_vs_WT_non_stranded.cntTable",2,2,15)
estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing
[1] "GeneTELabelNum exceeds number of DE TEs. Using max number possible instead."
createScatterPlots("TEtranscripts_GRCm38_mF9_OE_pCMV6_777-HA_vs_pSB_mock_non_stranded.cntTable",2,2,15)
estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing
[1] "GeneTELabelNum exceeds number of DE TEs. Using max number possible instead."
createScatterPlots("TEtranscripts_GRCm38_mF9_OE_pCMV6_777-R297W-HA_vs_pSB_mock_non_stranded.cntTable",2,2,15)
estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing
[1] "GeneTELabelNum exceeds number of DE TEs. Using max number possible instead."
createScatterPlots("TEtranscripts_GRCm38_mMEF_E13_DUFKZFP_cluster_KO_vs_WT_males_non_stranded.cntTable",2,2,15)
estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing
[1] "GeneTELabelNum exceeds number of DE TEs. Using max number possible instead."
Warning message: “No shared levels found between `names(values)` of the manual scale and the data's colour values.” Warning message: “No shared levels found between `names(values)` of the manual scale and the data's colour values.”
createScatterPlots("TEtranscripts_GRCm38_mMEF_E15_777-R297W-KI_vs_E15_WT_males_non_stranded.cntTable",3,2,15)
estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing
createScatterPlots("TEtranscripts_GRCm38_mMEF_E15_777-R297W-KI_vs_E16_WT_females_non_stranded.cntTable",3,2,15)
estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing
createScatterPlots("TEtranscripts_GRCm38_mMEF_E15_777KO_tm1a_vs_WT_females_non_stranded.cntTable",2,2,15)
estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing
[1] "GeneTELabelNum exceeds number of DE TEs. Using max number possible instead."
createScatterPlots("TEtranscripts_GRCm38_mMEF_E15_777KO_tm1a_vs_WT_males_non_stranded.cntTable",2,2,15)
estimating size factors estimating dispersions gene-wise dispersion estimates mean-dispersion relationship final dispersion estimates fitting model and testing